Introduction

This project is available on GitHub.

fictious data, produced by generate R script

Challenges built in

Biographical Data

The biographical data has 14 variables and 100000 observations. The data is stored at the donor level. Each row of the data represents a unique donor and biographical information about that donor.

Numeric Variables

There are 4 numeric variables:

  • id: A seven digit numeric id that is unique to each donor.
  • household_id: A seven digit numeric id that is unique to households. More than one donor may share a household_id.
  • lat: The latitude of the center point of each donor’s zipcode. Missing for donor’s residing outside the United States.
  • lon: The longitude of the center point of each donor’s zipcode. Missing for donor’s residing outside the United States.
## Rows: 100,000
## Columns: 4
## $ id           <dbl> 8275707, 2963581, 4302254, 7637444, 9369155, 1026439, 65…
## $ household_id <dbl> 1000235, 1000235, 1000303, 1000341, 1000341, 1000435, 10…
## $ lat          <dbl> 34.03, 41.29, NA, 36.07, 26.23, 33.60, 40.99, 38.82, 32.…
## $ lon          <dbl> -117.75, -92.63, NA, -94.15, -80.13, -117.71, -74.34, -7…

Character Variables

When loaded by default there are 9 character variables:

  • name: Each donor’s first and last name formatted as “last name, first name”.
  • country: Each donor’s country of residence.
  • city: Each donor’s city of residence.
  • deceased: A binary indicator that indicates if a donor is deceased (“Y”|“N”)
  • zip: The five digit zipcode of donor’s whose country of residence is the United States.
  • state: The two-letter state abbreviation for each donor whose country of residence is the United States.
  • capacity: Each donor’s capacity represented within an estimated range.
  • capacity_source: A categorical variable indicating how the capacity was determined (“institutional”|“screening”).
  • race: a categorical variable indicating the donor’s race.
## Rows: 100,000
## Columns: 9
## $ name            <chr> "al-Shakoor, Labeeb", "Nero, Brianna", "al-Rasheed, R…
## $ country         <chr> "United States", "United States", "China", "United St…
## $ city            <chr> "Pomona", "Oskaloosa", "Shenzhen", "Fayetteville", "P…
## $ deceased        <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ zip             <chr> "91766", "52577", NA, "72701", "33069", "92653", "074…
## $ state           <chr> "CA", "IA", NA, "AR", "FL", "CA", "NJ", "VA", "TX", N…
## $ capacity        <chr> "$50k - $75K", "$1k - $2.5k", "$5k - $10k", "$2.5k - …
## $ capacity_source <chr> "screening", "screening", "screening", "institutional…
## $ race            <chr> "Non-Hispanic white", "Non-Hispanic white", "Asian", …

country

deceased

state

capacity

capacity_source

race

Date Variables

There is 1 date variable:

  • birthday: The date of each donor’s birth stored as a date variable.
## Rows: 100,000
## Columns: 1
## $ birthday <date> 1923-11-18, 1925-03-18, 1924-08-28, 1923-05-14, 1921-10-11,…